{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sample, Explore, and Clean Taxifare Dataset \n",
    "\n",
    "**Learning Objectives**\n",
    "- Practice querying BigQuery\n",
    "- Sample from large dataset in a reproducible way\n",
    "- Practice exploring data using Pandas\n",
    "- Identify corrupt data and clean accordingly\n",
    "\n",
    "## Introduction \n",
    "In this notebook, we will explore a dataset corresponding to taxi rides in New York City to build a Machine Learning model that estimates taxi fares. The idea is to suggest a likely fare to taxi riders so that they are not surprised, and so that they can protest if the charge is much higher than expected. Such a model would also be useful for ride-hailing apps that quote you the trip price in advance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set up environment variables and load necessary libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that the Google BigQuery library is installed and if not, install it. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip freeze | grep google-cloud-bigquery==1.21.0 || pip install google-cloud-bigquery==1.21.0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After installation, restart the kernel by selecting **Kernel** > **Restart Kernel**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = \"cloud-training-demos\"  # Replace with your PROJECT\n",
    "REGION = \"us-central1\"            # Choose an available region for Cloud MLE"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.environ[\"PROJECT\"] = PROJECT\n",
    "os.environ[\"REGION\"] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%load_ext google.cloud.bigquery"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## View data schema and size\n",
    "\n",
    "Our dataset is hosted in [BigQuery](https://cloud.google.com/bigquery/): Google's petabyte scale, SQL queryable, fully managed cloud data warehouse. It is a publically available dataset, meaning anyone with a GCP account has access. \n",
    "\n",
    "1. Click [here](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=nyc-tlc&d=yellow&t=trips&page=table) to acess the dataset.\n",
    "2. In the web UI, below the query editor, you will see the schema of the dataset. What fields are available, what does each mean? \n",
    "3. Click the 'details' tab. How big is the dataset?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preview data\n",
    "\n",
    "Let's see what a few rows of our data looks like. Any cell that starts with `%%bigquery` will be interpreted as a SQL query that is executed on BigQuery, and the result is printed to our notebook.\n",
    "\n",
    "BigQuery supports [two flavors](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#comparison_of_legacy_and_standard_sql) of SQL syntax: legacy SQL and standard SQL. The preferred is standard SQL because it complies with the official SQL:2011 standard. To instruct BigQuery to interpret our syntax as such we start the query with `#standardSQL`.\n",
    "\n",
    "There are over 1 Billion rows in this dataset and it's 130GB large, so let's retrieve a small sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `nyc-tlc.yellow.trips`\n",
    "WHERE\n",
    "    RAND() < .0000001 -- sample a small fraction of the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preview data (alternate way)\n",
    "\n",
    "Alternatively we can use BigQuery's web UI to execute queries. \n",
    "\n",
    "1. Open the [web UI](https://console.cloud.google.com/bigquery)\n",
    "2. Paste the above query minus the `%%bigquery` part into the Query Editor\n",
    "3. Click the 'Run' button or type 'CTRL + ENTER' to execute the query\n",
    "\n",
    "Query results will be displayed below the Query editor."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample data repeatably\n",
    "\n",
    "There's one issue with using `RAND() < N` to sample. It's non-deterministic. Each time you run the query above you'll get a different sample. \n",
    "\n",
    "Since repeatability is key to data science, let's instead use a hash function (which is deterministic by definition) and then sample the using the modulo operation on the hashed value. \n",
    "\n",
    "We obtain our hash values using:\n",
    "\n",
    "`FARM_FINGERPRINT(CAST(hashkey AS STRING))`\n",
    "\n",
    "Working from inside out:\n",
    "\n",
    "- `CAST()`: Casts hashkey to string because our hash function only works on strings\n",
    "- `FARM_FINGERPRINT()`: Hashes strings to 64bit integers\n",
    "\n",
    "\n",
    "The `hashkey` should be:\n",
    "\n",
    "1. Unrelated to the objective \n",
    "2. Sufficiently high cardinality\n",
    "\n",
    "Given these properties we can sample our data repeatably using the modulo operation. \n",
    "\n",
    "To get a 1% sample:\n",
    "\n",
    "`WHERE ABS(MOD(hashvalue, 100)) = 0`\n",
    "\n",
    "To get a *different* 1% sample change the remainder condition, for example:\n",
    "\n",
    "`WHERE ABS(MOD(hashvalue, 100)) = 55`\n",
    "\n",
    "To get a 20% sample:\n",
    "\n",
    "`WHERE ABS(MOD(hashvalue, 100)) < 20` Alternatively: `WHERE ABS(MOD(hashvalue, 5)) = 0`\n",
    "\n",
    "And so forth...\n",
    "\n",
    "We'll use `pickup_datetime` as our hash key because it meets our desired properties. If such a column doesn't exist in the data you can synthesize a hashkey by concatenating multiple columns.\n",
    "\n",
    "Below we sample 1/5000th of the data. The syntax is admittedly less elegant than `RAND() < N`, but now each time you run the query you'll get the same result.\n",
    "\n",
    "\\**Tech note: Taking absolute value doubles the chances of hash collisions but since there are 2^64 possible hash values and less than 2^30 hash keys the collision risk is negligable.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `nyc-tlc.yellow.trips`\n",
    "WHERE \n",
    "    -- repeatable 1/5000th sample\n",
    "    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load sample into Pandas dataframe\n",
    "\n",
    "The advantage of querying BigQuery directly as opposed to the web UI is that we can supplement SQL analysis with Python analysis. A popular Python library for data analysis on structured data is [Pandas](https://pandas.pydata.org/), and the primary data strucure in Pandas is called a DataFrame.\n",
    "\n",
    "To store BigQuery results in a Pandas DataFrame we have have to query the data with a slightly differently syntax.\n",
    "\n",
    "1. Import the `google.cloud` `bigquery` module\n",
    "2. Create a variable called `bq` which is equal to the BigQuery Client `bigquery.Client()`\n",
    "2. Store the desired SQL query as a Python string\n",
    "3. Execute `bq.query(query_string).to_dataframe()` where `query_string` is what you created in the previous step\n",
    "\n",
    "**This will take about a minute**\n",
    "\n",
    "*Tip: Use triple quotes for a multi-line string in Python*\n",
    "\n",
    "*Tip: You can measure execution time of a cell by starting that cell with `%%time`*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.cloud import bigquery\n",
    "bq = bigquery.Client(project=PROJECT)\n",
    "\n",
    "query_string = \"\"\"\n",
    "#standardSQL\n",
    "SELECT\n",
    "    *\n",
    "FROM\n",
    "    `nyc-tlc.yellow.trips`\n",
    "Where\n",
    "    -- repeatable 1/5000th sample\n",
    "    ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1\n",
    "\"\"\"\n",
    "\n",
    "trips = bq.query(query_string).to_dataframe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore datafame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(type(trips))\n",
    "trips.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Python variable `trips` is now a Pandas DataFrame. The `.head()` function above prints the first 5 rows of a DataFrame. \n",
    "\n",
    "The rows in the DataFrame may be in a different order than when using `%%bigquery`, but the data is the same.\n",
    "\n",
    "It would be useful to understand the distribution of each of our columns, which is to say the mean, min, max, standard deviation etc..\n",
    "\n",
    "A DataFrame's `.describe()` method provides this. By default it only analyzes numeric columns. To include stats about non-numeric column use `describe(include='all')`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trips.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distribution analysis\n",
    "\n",
    "Do you notice anything off about the data? Pay attention to `min` and `max`. Latitudes should be between -90 and 90, and longitudes should be between -180 and 180, so clearly some of this data is bad.\n",
    "\n",
    "Further more some trip fares are negative and some passenger counts are 0 which doesn't seem right. We'll clean this up later."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Investigate trip distance\n",
    "\n",
    "Looks like some trip distances are 0 as well, let's investigate this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trips[trips[\"trip_distance\"] == 0][:10] # first 10 rows with trip_distance == 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears that trips are being charged substantial fares despite having 0 distance. \n",
    "\n",
    "Let's graph `trip_distance` vs `fare_amount` using the Pandas [`.plot()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html) method to corroborate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "trips.plot(x = \"trip_distance\", y = \"fare_amount\", kind = \"scatter\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It appears that we have a lot of invalid data that is being coded as zero distance and some fare amounts that are definitely illegitimate. Let's remove them from our analysis. We can do this by modifying the BigQuery query to keep only trips longer than zero miles and fare amounts that are at least the minimum cab fare ($2.50)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identify correct label\n",
    "\n",
    "Should we use `fare_amount` or `total_amount` as our label? What's the difference?\n",
    "\n",
    "To make this clear let's look at some trips that included a toll."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "trips[trips[\"tolls_amount\"] > 0][:10] # first 10 rows with toll_amount > 0 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the samples above, we can see that the total amount reflects fare amount, toll and tip somewhat arbitrarily -- this is because when customers pay cash, the tip is not known. In any case tips are discretionary and shoud not be included in our fare estimation tool.\n",
    "\n",
    "So, we'll use the sum of `fare_amount` + `tolls_amount` as our label."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Select useful fields\n",
    "\n",
    "What fields do you see that may be useful in modeling taxifare? They should be\n",
    "\n",
    "1. Related to the objective\n",
    "2. Available at prediction time \n",
    "\n",
    "**Related to the objective**\n",
    "\n",
    "For example we know `passenger_count` shouldn't have any affect on fare because fare is calculated by time and distance. Best to eliminate it to reduce the amount of noise in the data and make the job of the ML algorithm easier.\n",
    "\n",
    "If you're not sure whether a column is related to the objective, err on the side of keeping it and let the ML algorithm figure out whether it's useful or not. \n",
    "\n",
    "**Available at prediction time** \n",
    "\n",
    "For example `trip_distance` is certainly related to the objective, but we can't know the value until a trip is completed (depends on the route taken), so it can't be used for prediction. \n",
    "\n",
    "**We will use the following** \n",
    "\n",
    "`pickup_datetime`, `pickup_longitude`, `pickup_latitude`, `dropoff_longitude`, and `dropoff_latitude`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean the data\n",
    "\n",
    "We need to do some clean-up of the data:\n",
    "\n",
    "- Filter to latitudes and longitudes that are reasonable for NYC\n",
    "- We shouldn't fare amounts < 2.50\n",
    "- Trip distances and passenger counts should be non-zero\n",
    "- Have the label reflect the sum of fare_amount and tolls_amount\n",
    "\n",
    "Let's change the BigQuery query appropriately, and only return the fields we'll use in our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery --project $PROJECT\n",
    "#standardSQL\n",
    "SELECT\n",
    "    (tolls_amount + fare_amount) AS fare_amount, -- label\n",
    "    pickup_datetime,\n",
    "    pickup_longitude, \n",
    "    pickup_latitude, \n",
    "    dropoff_longitude, \n",
    "    dropoff_latitude\n",
    "FROM\n",
    "    `nyc-tlc.yellow.trips`\n",
    "WHERE\n",
    "    -- Clean Data\n",
    "    trip_distance > 0\n",
    "    AND passenger_count > 0\n",
    "    AND fare_amount >= 2.5\n",
    "    AND pickup_longitude > -78\n",
    "    AND pickup_longitude < -70\n",
    "    AND dropoff_longitude > -78\n",
    "    AND dropoff_longitude < -70\n",
    "    AND pickup_latitude > 37\n",
    "    AND pickup_latitude < 45\n",
    "    AND dropoff_latitude > 37\n",
    "    AND dropoff_latitude < 45\n",
    "    -- repeatable 1/5000th sample\n",
    "    AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 5000)) = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now have a repeatable and clean sample we can use for modeling taxi fares. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright 2022 Google Inc.\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with the License. You may obtain a copy of the License at\n",
    "http://www.apache.org/licenses/LICENSE-2.0\n",
    "Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
